Search CORE

62 research outputs found

Protein sequence classification using feature hashing

Author: Caragea Cornelia
Mitra Prasenjit
Silvescu Adrian
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

UNT Digital Library

Recommended from our members

Protein sequence classification using feature hashing

Author: Caragea Cornelia
Mitra Prasenjit
Silvescu Adrian
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Article discussing protein sequence classification using feature hashing

UNT Digital Library

Notes on the taxonomy, geography, and ecology of the piliferous campylopus species in the Netherlands and N.W. Germany

Author: Adrian Silvescu (85348)
Cornelia Caragea (85346)
Drena Dobbs (20241)
Jivko Sinapov (85347)
Vasant Honavar (20242)
Publication venue: [s.n.]
Publication date: 01/01/1968
Field of study

Copyright information:Taken from "Glycosylation site prediction using ensembles of Support Vector Machine classifiers"http://www.biomedcentral.com/1471-2105/8/438BMC Bioinformatics 2007;8():438-438.Published online 9 Nov 2007PMCID:PMC2220009. of the data extracted from the original glycoprotein sequence dataset for C-linked glycosylation using local sequence identity with 0/1 String Kernel

Wageningen University & Research Publications

FigShare

Recommended from our members

Classifying Scientific Publications Using Abstract Features

Author: Caragea Cornelia
Caragea Doina
Kataria Saurabh
Mitra Prasenjit
Silvescu Adrian
Publication venue: American Association for Artificial Intelligence
Publication date: 01/01/2011
Field of study

Article discussing classifying scientific publications using abstract features

UNT Digital Library

Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models

Author: A Blum
A Goldberg
A Höglund
Adrian Silvescu
AP Dempster
Cornelia Caragea
CS Ong
D Ron
Doina Caragea
G Camps-valls
G Casella
J Lafferty
J Lin
J Weston
J Zhang
JL Gardy
K Nigam
K Park
L Breiman
L Käll
M Belkin
M Li
M Szummer
MS Scott
ND Lawrence
O Emanuelsson
P Baldi
P Kuksa
Q Xu
T Jaakkola
T Jebara
T Joachims
TG Dietterich
Vasant Honavar
W Ansorge
X Zhu
Y Bengio
Y Grandvalet
Y Qi
Y Yuan
ZY Niu
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Determination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data. However, because of the high cost and effort involved in labeling the data, the amount of labeled data is quite small compared to the amount of unlabeled data. Hence, there is a growing interest in developing <it>semi-supervised methods</it> for predicting protein subcellular localization from large amounts of unlabeled data together with small amounts of labeled data. Results In this paper, we present an Abstraction Augmented Markov Model (AAMM) based approach to semi-supervised protein subcellular localization prediction problem. We investigate the effectiveness of AAMMs in exploiting <it>unlabeled</it> data. We compare semi-supervised AAMMs with: (i) Markov models (MMs) (which do not take advantage of unlabeled data); (ii) an expectation maximization (EM); and (iii) a co-training based approaches to semi-supervised training of MMs (that make use of unlabeled data). Conclusions The results of our experiments on three protein subcellular localization data sets show that semi-supervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semi-supervised MMs; and (iii) are comparable in performance, and in some cases outperform, the co-training based semi-supervised MMs.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

UNT Digital Library

Glycosylation site prediction using ensembles of Support Vector Machine classifiers

Author: A Bairoch
A Elhammer
A Varki
Adrian Silvescu
B Eisenhaber
B Vanschoenwinkel
B Wilson
C Caragea
C Van Rijsbergen
C von der Lieth
CJC Burges
Cornelia Caragea
Drena Dobbs
DT Jones
F Wu
J Hansen
J Krieg
J Platt
JH Kim
Jivko Sinapov
K Pilobello
M Doucey
M Gribskov
N Blom
NV Chawla
ON Jensen
P Baldi
P Mentesana
R Duda
R Dwek
R Gupta
R Gupta
R Haltiwanger
S Li
S Miyamoto
S Russell
SF Altschul
T Christlet
TG Dietterich
TM Mitchell
V Vapnik
Vasant Honavar
Y Gavel
ZR Yang
ZR Yang
Publication venue: BioMed Central
Publication date
Field of study

Article discussing the performance of different computational methods for prediction of glycosylation sites from amino acid sequences

Crossref

PubMed Central

UNT Digital Library

A Graphical Model for Shallow Parsing Sequences

Author: Adrian Silvescu
Adrian Silvescu And
Publication venue
Publication date
Field of study

this paper we explore the interval between these two extremes with a model that does not attempt to be a global characterisation of the sequence as in the case of PCFG/HMM but yet it does not assume independence among the generative processes of the subsequent elements in the sequenc

CiteSeerX

Structural induction: towards automatic ontology elicitation

Author: Adrian Silvescu
Publication venue
Publication date: 01/01/2008
Field of study

Induction is the process by which we obtain laws (and more encompassing - theories) about the world. This process can thought as aiming to derive two aspects of a theory: a Structural aspect and a Numerical aspect respectively. The Structural aspect is concerned with the entities modeled and their interrelationship, also known as ontology. The Numerical aspect is concerned with the quantities involved in the relationships among the above-mentioned entities along with uncertainties either postulated to exist in the world or inherent to the nature of the induction process.;In this thesis we will focus more on the structural aspect hence the name: Structural Induction: toward Automatic Ontology Elicitation. In order to deal with the problem of Structural Induction we need to solve two main problems: (1) We have to say what we mean by Structure (What? ); and (2) We have to say how to get it (How?). In this thesis we give one very definite answer to the first question ( What?) and we also explore how to answer the second question (How?) in some particular cases. A comprehensive answer to the second question (How?) in the most general setup will involve dealing very carefully with the interplay between the Structural and Numerical aspects of Induction and will represent a full solution to the Induction problem. This is a vast enterprise and we will only be able touch some aspects of this issue.;The main thesis presented in this work is that the fundamental structural elements based on which theories are constructed are Abstraction (grouping similar entities under one overarching category) and Super-Structuring (grouping into a bigger unit topologically close entities - in particular spatio-temporally close). This thesis is supported by showing that each member of the Turing equivalent class of General Generative Grammars can be decomposed in terms of these operators and their duals (Reverse Abstraction and Reverse SuperStructuring, respectively). Thus, if we are to believe the Computationalistic Assumption (that the most general way to present a finite theory is by the means of an entity expressed in a Turing equivalent formalism) we have proved that our thesis is correct. We call this thesis the Abstraction + SuperStructuring thesis. The rest of the thesis is concerned with issues in the opened by the second question presented above ( How?): Given that we have established what we mean by Structure, how to get it

Digital Repository @ Iowa State University (ISU)

CiteSeerX

Fourier Neural Networks

Author: Adrian Silvescu
Publication venue
Publication date: 21/11/2007
Field of study

A new kind of neuron model that has a Fourier-like IN/OUT function is introduced. The model is discussed in a general theoretical framework and some completeness theorems are presented. Current experimental results show that the new model outperforms by a large margin both in representational power and convergence speed the classical mathematical model of neuron based on weighted sum of inputs ltered by a nonlinear function. The new model is also appealing from a neurophysiological point of view because it produces a more realistic representation by considering the inputs as oscillations

CiteSeerX